-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visualize Dataset statistics in metadata panel #1472
Conversation
For the overflow case, I thought 29000 rows would be displayed as 29K? Or is that just a forced example to show the overflow behaviour? Would you be able to provide a screenshot of the skeleton loader in action? |
I wanted to show you how the overflow looks. If you want it to be in a single row, I will implement the row and column shortforms (like 29K). During our call we thought this will not show exact number (unless we have tooltip or something like that). If the 2 rows are not bad, we can have it this way for now.
Skeleton loader is not needed for this implementation as the dataset statistics are pre-calculated. However, we might need it as a general improvement for the api call which is not related to this specific ticket. I will create a new ticket to track the skeleton loader implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first review and left initial comments. It's a lot of changes, so I'll do a second review asap!
In terms of using hooks for getting the file size. You could potentially do something like:
@hook_impl
def after_catalog_created(self, catalog: DataCatalog):
datasets = catalog.datasets
for ds_name, ds in datasets.items():
try:
fp = ds._filepath
except Exception:
logger.error("Something went wrong trying to get the filepath")
This is super rough and I haven't tried it properly, but it's a way to get the filepath for each dataset in the catalog after the catalog has been created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll approve this PR from the JS side and will let Merel, Nok, and others weight in on the Python part.
I've left a couple of smaller comments that I hope you can get to before merging.
Well done! Looking forward to this being released.
Hi @tynandebold , @vladimir-mck Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ravi-kumar-pilla , the python tests & code look good now! 👍 ⭐
Where can we view that in action? I clicked on several nodes but didn't see anything overflowing. |
You can check for datasets - |
Hi @stichbury , I need your suggestion in documenting this feature. Once the feature is pushed in the new release - What happens Users can see the statistics (rows, columns, file size) for the datasets which are instances of pandas Data Frame. In case the stat is not available or if the dataset is not an instance of pandas Data Frame, the users will see N/A for the stat. All this information will be displayed under Dataset statistics row in the metadata panel What should users do Step 1 - Step 2 - Note: A stats file will be created in the kedro project folder (stats.json) Step3 - The command opens a browser tab to serve the visualisation at http://127.0.0.1:4141/ Click on a dataset node and check the metadata panel. The users should see something like below - @stichbury is https://github.com/kedro-org/kedro/tree/main/docs/source/visualisation good place to document this ? |
try: | ||
with open("stats.json", "w", encoding="utf8") as file: | ||
sorted_stats_data = { | ||
dataset_name: stats_order(stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the name stats_order
slightly confusing. Maybe format_json
or prettify_json
?
try: | ||
with open("stats.json", "w", encoding="utf8") as file: | ||
sorted_stats_data = { | ||
dataset_name: stats_order(stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can it be just a helper method stay within the class of the Hook
? I don't see why we want to put it in utils.py
since this is very specific to format the JSON produce by the Hook itself.
try: | ||
with open("stats.json", "w", encoding="utf8") as file: | ||
sorted_stats_data = { | ||
dataset_name: stats_order(stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same apply to the other functions. Can we not have the utils.py
?
package/kedro_viz/models/utils.py
Outdated
def get_file_size(file_path: Union[str, None]) -> Union[int, None]: | ||
"""Get the dataset file size using fsspec. If the file_path is a directory, | ||
get the latest file created (this corresponds to the latest run) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we calculate this from the hook? It should be much simpler to do this if you have the Dataset
object. That way you don't need to re-create all the fsspec
logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried doing that using _filepath in after_context_created hook. But since AbstractDataset does not have an api supporting that, you mentioned it to be fragile and so thought it would be a good idea to move it here
@noklam I have modified the hooks to include file_size. Though I am calculating the file_size only if the dataset instance is pd.Dataframe. Earlier, I calculated it for every datanode and transcoded datanode which contains file_path. Please have a look and let me know if there needs to be any changes. Thank you ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with non-blocking comments
if isinstance(data, pd.DataFrame): | ||
self._stats[stats_dataset_name]["rows"] = int(data.shape[0]) | ||
self._stats[stats_dataset_name]["columns"] = int(data.shape[1]) | ||
|
||
current_dataset = self.datasets.get(dataset_name, None) | ||
|
||
if current_dataset: | ||
self._stats[stats_dataset_name]["file_size"] = self.get_file_size( | ||
current_dataset | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start ! However it only works with dataframes.
There are 2 additional cases that can be easily implemented :
- In the case of a PartitionedDataSet that loads pd.DataFrame s
- In the case of loading multiple sheets of an Excel file (load_args: sheet_name: None )
I think it's not too difficult to account for those cases by adding additional logic on data, and it would avoid just throwing N/A in the UI.
Do you see how to implement that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! thank you
I'm not sure that this fully resolves the request (#662), Ideally, we would want to see the dataset sizes directly in the graph view, so we can view any issues with the pipeline without having to click through each node in the graph. Even better if we had some way to set up some rules to color the nodes (if N=0, then color the node red). Please correct me in case I didn't fully understand the update. Added more details: #662 (comment) |
Hi @lukaszdz , we are planning on integrating the statistics into the graph which would resolve this issue completely. We are exploring options on how to approach this without cluttering the flowchart view. This is the first step towards visualizing the statistics in Kedro Viz. Thank you |
Description
Resolves #662
Development notes
Screenshots:
All stats -
File size only -
File size not configured due to no file path -
Overflow Dataset stats -
QA notes
Steps to QA :
Note:
Checklist
RELEASE.md
file